---
title: 2. Choose a Chunk Mode
---

After uploading content to the knowledge base, the next step is chunking and data cleaning. **This stage involves content preprocessing and structuring, where long texts are divided into multiple smaller chunks.**

<Accordion title="What is the Chunking and Cleaning Strategy?">
* **Chunking**

Due to the limited context window of LLMs, it is hard to process and transmit the entire knowledge base content at once. Instead, long texts in documents must be splited into content chunks. Even though some advanced models now support uploading complete documents, studies show that retrieval efficiency remains weaker compared to querying individual content chunks.

The ability of an LLM to accurately answer questions based on the knowledge base depends on the retrieval effectiveness of content chunks. This process is similar to finding key chapters in a manual for quick answers, without the need to analyze the entire document line by line. After chunking, the knowledge base uses a Top-K retrieval method to identify the most relevant content chunks based on user queries, supplementing key information to enhance the accuracy of responses.

The size of the content chunks is critical during semantic matching between queries and chunks. Properly sized chunks enable the model to locate the most relevant content accurately while minimizing noise. Overly large or small chunks can negatively impact retrieval effectiveness.

Dify offers two chunking modes: **General Mode** and **Parent-child Mode**, tailored to different document structures and application scenarios. These modes are designed to meet varying requirements for retrieval efficiency and accuracy in knowledge bases.

* **Cleaning**

To ensure effective text retrieval, it’s essential to clean the data before uploading it to the knowledge base. For instance, meaningless characters or empty lines can affect the quality of query responses and should be removed. Dify provides built-in automatic cleaning strategies. For more information, see [ETL](/en/guides/knowledge-base/create-knowledge-and-upload-documents#optional-etl-configuration).
</Accordion>

Whether an LLM can accurately answer knowledge base queries depends on how effectively the system retrieves relevant content chunks. High-relevance chunks are crucial for AI applications to produce precise and comprehensive responses.

In an AI customer chatbot scenario, for example, directing the LLM to the key content chunks in a tool manual is sufficient to quickly answer user questions—no need to repeatedly analyze the entire document. This approach saves tokens during the analysis phase while boosting the overall quality of the AI-generated answers.

### Chunk Mode

The knowledge base supports two chunking modes: **General Mode** and **Parent-child Mode**. If you are creating a knowledge base for the first time, it is recommended to choose Parent-Child Mode.

<Info>
**Please note**: The original **“Automatic Chunking and Cleaning”** mode has been automatically updated to **“General”** mode. No changes are required, and you can continue to use the default setting.

Once the chunk mode is selected and the knowledge base is created, it cannot be changed later. Any new documents added to the knowledge base will follow the same chunking strategy.
</Info>

![General mode and Parent-child mode](https://assets-docs.dify.ai/2024/12/b3052a6aae6e4d0e5701dde3a859e326.png)

#### General Mode

Content will be divided into independent chunks. When a user submits a query, the system automatically calculates the relevance between the chunks and the query keywords. The top-ranked chunks are then retrieved and sent to the LLM for processing the answers.

In this mode, you need to manually define text chunking rules based on different document formats or specific scenario requirements. Refer to the following configuration options for guidance:

*   **Chunk identifier**, The default value is `\n`, which means the text will be chunked by paragraphs. You can customize chunking rules using [regex](https://regexr.com/). The system will automatically execute chunking whenever it detects the specified delimiter. For example,  means chunk the text by sentences.

    ![Chunk results of different identifier syntaxes](https://assets-docs.dify.ai/2024/12/2c19c1c1a0446c00e3c07d6f4c8968e4.png)
* **Maximum chunk length:** Specifies the maximum number of text characters allowed per chunk. If this limit is exceeded, the system will automatically enforce chunking. The default value is 500 tokens, with a maximum chunk length of 4000 tokens.
* **Overlapping chunk length**: When data is chunked, there is a certain amount of overlap between chunks. This overlap can help to improve the retention of information and the accuracy of analysis, and enhance retrieval effects. It is recommended that the setting be 10-25% of the chunk length Tokens.

**Text Preprocessing Rules**, Text preprocessing rules help filter out irrelevant content from the knowledge base. The following options are available:

* Replace consecutive spaces, newline characters, and tabs
* Remove all URLs and email addresses

Once configured, click **“Preview Chunk”** to see the chunking results. You can view the character count for each chunk. If you modify the chunking rules, click the button again to view the latest generated text chunks.

If multiple documents are uploaded in bulk, you can switch between them by clicking the document titles at the top to review the chunk results for other documents.

![General mode](https://assets-docs.dify.ai/2024/12/b3ec2ce860550563234ca22967abdd17.png)

After setting the chunking rules, the next step is to specify the indexing method. General mode supports **High-Quality Indexing** **Method** and **Economical Indexing Method**. For more details, please refer to [Set up the Indexing Method](/en/guides/knowledge-base/create-knowledge-and-upload-documents/setting-indexing-methods).

#### Parent-child Mode

Compared to **General mode**, **Parent-child mode** uses a two-tier data structure that balances precise retrieval with comprehensive context, combining accurate matching and richer contextual information.

In this mode, parent chunks (e.g., paragraphs) serve as larger text units to provide context, while child chunks (e.g., sentences) focus on pinpoint retrieval. The system searches child chunks first to ensure relevance, then fetches the corresponding parent chunk to supply the full context—thereby guaranteeing both accuracy and a complete background in the final response. You can customize how parent and child chunks are split by configuring delimiters and maximum chunk lengths.

For example, in an AI-powered customer chatbot case, a user query can be mapped to a specific sentence within a support document. The paragraph or chapter containing that sentence is then provided to the LLM, filling in the overall context so the answer is more precise.

Its fundamental mechanism includes:

* **Query Matching with Child Chunks:**
  * Small, focused pieces of information, often as concise as a single sentence within a paragraph, are used to match the user's query.
  * These child chunks enable precise and relevant initial retrieval.
* **Contextual Enrichment with Parent Chunks:**
  * Larger, encompassing sections—such as a paragraph, a section, or even an entire document—that include the matched child chunks are then retrieved.
  * These parent chunks provide comprehensive context for the Language Model (LLM).

![Parent-child mode schematic](https://assets-docs.dify.ai/2024/12/3e6820c10bd7c5f6884930e3a14e7b66.png)

In this mode, you need to manually configure separate chunking rules for both parent and child chunks based on different document formats or specific scenario requirements.

**Parent Chunk**

The parent chunk settings offer the following options:

*   **Paragraph**

    This mode splits the text in to paragraphs based on delimiters and the maximum chunk length, using the split text as the parent chunk for retrieval. Each paragraph is treated as a parent chunk, suitable for documents with large volumes of text, clear content, and relatively independent paragraphs. The following settings are supported:

    * **Chunk Delimiter**: The default value is `\n`, which chunks text by paragraphs. You can customize the chunking rules using [regex](https://regexr.com/). The system automatically chunks the text whenever the specified delimiter appears.
    * **Maximum chunk length:** Specifies the maximum number of text characters allowed per chunk. If this limit is exceeded, the system will automatically enforce chunking. The default value is 500 tokens, with a maximum chunk length of 4000 tokens.
*   **Full Doc**

    Instead of splitting the text into paragraphs, the entire document is used as the parent chunk and retrieved directly. For performance reasons, only the first 10,000 tokens of the text are retained. This setting is ideal for smaller documents where paragraphs are interrelated, requiring full doc retrieval.

![The difference between p<strong>aragraph and Full doc</strong>](https://assets-docs.dify.ai/2024/12/e3814336710d445a99a9ded3d251622b.png)

**Child Chunk**

Child chunks are derived from parent chunks by splitting them based on delimiter rules. They are used to identify and match the most relevant and direct information to the query keywords. When using the default child chunking rules, the segmentation typically results in the following:

* If the parent chunk is a paragraph, child chunks correspond to individual sentences within each paragraph.
* If the parent chunk is the full document, child chunks correspond to the individual sentences within the document.

You can configure the following chunk settings:

* **Chunk Delimiter**: The default value is , which chunks text by sentences. You can customize the chunking rules using [regex](https://regexr.com/). The system automatically chunks the text whenever the specified delimiter appears.
* **Maximum chunk length:** Specifies the maximum number of text characters allowed per chunk. If this limit is exceeded, the system will automatically enforce chunking. The default value is 200 tokens, with a maximum chunk length of 4000 tokens.

You can also use **text preprocessing rules** to filter out irrelevant content from the knowledge base:

* Replace consecutive spaces, newline characters, and tabs
* Remove all URLs and email addresses

After completing the configuration, click **“Preview Chunks”** to view the results. You can see the total character count of the parent chunk.

Once configured, click **“Preview Chunk”** to see the chunking results. You can see the total character count of the parent chunk. Characters highlighted in blue represent child chunks, and the character count for the current child chunk is also displayed for reference.

If you modify the chunking rules, click the button again to view the latest generated text chunks.

If multiple documents are uploaded in bulk, you can switch between them by clicking the document titles at the top to review the chunk results for other documents.

![Parent-child mode](https://assets-docs.dify.ai/2024/12/af5c9a68f85120a6ea687bf93ecfb80a.png)

To ensure accurate content retrieval, the Parent-child chunk mode only supports the [High-Quality Indexing](/en/guides/knowledge-base/create-knowledge-and-upload-documents/setting-indexing-methods#high-quality).

### What's the Difference Between Two Modes？

The difference between the two modes lies in the structure of the content chunks. **General Mode** produces multiple independent content chunks, whereas **Parent-child Mode** uses a two-layer chunking approach. In this way, a single parent chunk (e.g., the entire document or a paragraph) contains multiple child chunks (e.g., sentences).

Different chunking methods influence how effectively the LLM can search the knowledge base. When used on the same document, Parent-child Retrieval provides more comprehensive context while maintaining high precision, making it significantly more effective than the traditional single-layer approach.

![The Difference Between General Mode and Parent-child Mode](https://assets-docs.dify.ai/2024/12/0b614c6a07c6ea2151fe17d85ce6a1d1.png)

### Reference

After choosing the chunking mode, refer to the following documentation to configure the indexing method and retrieval method and finis the creation of your knowledge base.


<Card title="Select the Indexing Method" icon="link" href="/en/guides/knowledge-base/create-knowledge-and-upload-documents/setting-indexing-methods">
Check for more details.
</Card>

{/*
Contributing Section
DO NOT edit this section!
It will be automatically generated by the script.
*/}

---

[Edit this page](https://github.com/langgenius/dify-docs/edit/main/en/guides/knowledge-base/create-knowledge-and-upload-documents/chunking-and-cleaning-text.mdx) | [Report an issue](https://github.com/langgenius/dify-docs/issues/new?template=docs.yml)

